BIG DATA ANALYTICS BIG DATA ANALYTICS MSCIT PART 1 SEM II INDEX Practical N o. Title Date Sign 1. Installation of Hadoop on Windows 10. 2. Exploring Hadoop Distributed File System (HDFS). 3. Store big data in Hbase / Mongodb / Pig using Hadoop / R. 4. K means clustering. 5. Apriori algorithm using Groceries dataset. 6. Linear regression Model. a. Simple Linear regression. b. Logistic regression. BIG DATA ANALYTICS MSCIT PART 1 SEM II 7. a. Program to draw Decision Tree. b. Na ï ve Bayes Classification. 8. Text Analysis. 9. Classification model. 10. 11. a. Implement SVM Classification Technique. Clustering Model i. Clustering algorithms for unsupervised classification. ii. Plot the cluster data using R visualizations. BIG DATA ANALYTICS MSCIT PART 1 SEM II PRACTICAL NO: 01 Aim: Hadoop Installation on Windows 10. Prerequisite : To install Hadoop, you should have Java version 1.8 in your system. Check your java version through this command on command prompt java -version Create a new user variable. Put the Variable_name as “ HADOOP_HOME ” and Variable_value as the path of the bin folder where you extracted hadoop. BIG DATA ANALYTICS MSCIT PART 1 SEM II Likewise, create a new user variable with variable name as “ JAVA_HOME ” and variable value as the path of the bin folder in the Java directory. Now we need to set Hadoop bin directory and Java bin directory path in system variable path. Edit Path in system variable. Click on New and add the bin directory path of Hadoop and Java in it. BIG DATA ANALYTICS MSCIT PART 1 SEM II HADOOP JAVA BIG DATA ANALYTICS MSCIT PART 1 SEM II Configurations Now we need to edit some files located in the hadoop directory of the etc folder where we installed hadoop The files that need to be edited have been highlighted. 1. Edit the file core-site.xml in the hadoop directory. Copy this xml property in the configuration in the file. <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> BIG DATA ANALYTICS MSCIT PART 1 SEM II 2. Edit mapred-site.xml and copy this property in the configuration. <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> BIG DATA ANALYTICS MSCIT PART 1 SEM II 3. Create a folder “ data ” in the hadoop directory. 4. Create a folder with the name “ datanode ” and a folder “ namenode ” in this data directory BIG DATA ANALYTICS MSCIT PART 1 SEM II 5. Edit the file hdfs-site.xml and add below property in the configuration. Note: The path of namenode and datanode across value would be the path of the datanode and namenode folders you just created. <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>C:\hadoop- 3.3.0\data\namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>C:\hadoop- 3.3.0\data\datanode</value> </property> </configuration> BIG DATA ANALYTICS MSCIT PART 1 SEM II 6. Edit the file yarn-site.xml and add below property in the configuration. <configuration> <property> <name>yarn.nodemanager.aux- services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration> BIG DATA ANALYTICS MSCIT PART 1 SEM II 7. Edit hadoop-env.cmd and replace %JAVA_HOME% with the path of the java folder where your jdk 1.8 is installed. 8. Hadoop needs windows OS specific files which does not come with default download of hadoop. To include those files, replace the bin folder in hadoop directory with the bin folder provided in this github link. https://github.com/s911415/apache-hadoop-3.1.0-winutils Download it as zip file. Extract it and copy the bin folder in it. If you want to save the old bin folder, rename it like bin_old and paste the copied bin folder in that directory. BIG DATA ANALYTICS MSCIT PART 1 SEM II Check whether hadoop is successfully installed by running this command on cmd- hadoop – version Format the NameNode Formatting the NameNode is done once when hadoop is installed and not for running hadoop filesystem, else it will delete all the data inside HDFS. Run this command - hdfs namenode – format Now change the directory in cmd to sbin folder of hadoop directory with this command, Start namenode and datanode with this command – start-dfs.cmd Two more cmd windows will open for NameNode and DataNode Now start yarn through this command - start- yarn.cmd BIG DATA ANALYTICS MSCIT PART 1 SEM II Note: Make sure all the 4 Apache Hadoop Distribution windows are up and running. If they are not running, you will see an error or a shutdown message. In that case, you need to debug the error. To access information about resource manager current jobs, successful and failed jobs, go to this link in browser - http://localhost:8088/cluster To check the details about the hdfs (namenode and datanode) - http://localhost:9870/ BIG DATA ANALYTICS MSCIT PART 1 SEM II PRACTICAL NO: 02 Aim: Exploring Hadoop Distributed File System (HDFS) To implement the following file management tasks in Hadoop System (HDFS): Adding files and directories, Retrieving files, Deleting file HDFS COMMANDS Local File System and HDFS(Hadoop Distributed File System) Local file system is the file system of your own computer. Say if you are using windows, windows operating system will be having it's own way of managing files and folders. Same applies for linux. We have also seen Hadoop has got it's own way of storing files and that is called HDFS. We have also seen, the data nodes are independent computers which gets the instruction from Name Node for storing the files. So if you think a little, a Name Node is actually using the Hadoop Distributed File System where as the Data Nodes uses it's own file system (i.e linux) HDFS Command line The two commands that helps us to interact with the HDFS are 'hadoop fs' and 'hdfs dfs'. The only difference is 'hdfs dfs' helps us to deal only with the HDFS file system and using 'hadoop fs' we can work with other file systems as well. 1. Creating a directory in HDFS The 'mkdir' command is used to create a directory in HDFS. Syntax : hadoop fs -mkdir /<directory- name> Example : hadoop fs -mkdir /mscit A directory named 'mscit' is created under the root directory. 2. Copying files from local file system to HDFS To copy the files from local file system to HDFS 'copyFromLocal' command is used. Syntax: hadoop fs -copyFromLocal <local-file-path> <hdfs-file- path> BIG DATA ANALYTICS MSCIT PART 1 SEM II Example: hadoop fs -copyFromLocal C:\hadoop-3.3.0\etc\hadoop\employees.csv /mscit The above command copies employee.csv from your local file system to the newly created directory 'mscit' in HDFS. put command Syntax: hadoop fs -put <local-file-path> <hdfs-file- path> Example: hadoop fs -put C:\hadoop-3.3.0\etc\hadoop\employees.csv /mscit The above command does the same thing. i.e. Copies employee.csv from your local file system to the newly created directory “ mscit ” in HDFS. 3. Copying files from HDFS to local file system To copy the files from HDFS to local file system 'get' command is used. Syntax: hadoop fs -get <hdfs-file-path> <local-file-path> Example: hadoop fs -get \mscit\employee.csv C:\hadoop-3.3.0\etc\hadoop 4. Viewing a file in HDFS To view a file in HDFS 'cat' command is used. Syntax: hadoop fs – cat <filename> Example: hadoop fs -cat \mscit\employees.csv BIG DATA ANALYTICS MSCIT PART 1 SEM II 5. Display the contents of a directory in HDFS To display the contents if a directory in HDFS 'ls' command is used. Syntax: hadoop fs – ls <filename> Example: hadoop fs -ls /mscit 6. Deleting all files from a directory in HDFS To delete files from HDFS 'rm' command is used. Syntax: hadoop fs – rm /<directory-name>/* Example: hadoop fs -rm /mscit/* BIG DATA ANALYTICS MSCIT PART 1 SEM II PRACTICAL NO: 03 Aim: Implement an application that store big data in Hbase/ Mongodb/ Pig using Hadoop / R. Hbase - Standalone mode installation. STEP - 1: Extract the HBase file Download from: http://www.apache.org/dyn/closer.lua/hbase/ Extract file hbase-1.4.7-bin.tar.gz and place under "C:\hbase-1.4.7", you can use any preferred location – STEP - 2: Configure Environment variable. Set the path for the following Environment variable (User Variables) on windows 10 – HBASE_HOME - C:\hbase- 1.4.7 This PC - > Right Click - > Properties - > Advanced System Settings - > Advanced - > Environment Variables. BIG DATA ANALYTICS MSCIT PART 1 SEM II STEP - 3: Configure System variable. STEP - 4: Create required folders. 1. Create folder "hbase" under “ C:\hbase- 1.4.7 ” 2. Create folder "zookeeper" under “ C:\hbase- 1.4.7 ” BIG DATA ANALYTICS MSCIT PART 1 SEM II STEP - 5: Configured required files. Next, essential to configure two key files with minimal required details – • hbase- env.cmd • hbase- site.xml 1) Edit file "C:\hbase-1.4.7\conf\hbase-env.cmd", mention JAVA_HOME path in the location and save this file. @rem set JAVA_HOME=c:\apps\java set JAVA_HOME=C:\PROGRA~1\Java\jdk1.8.0_251